We will re-use the colon cancer data in GSE33126

Importing the data

library(GEOquery)
url <- "ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SeriesMatrix/GSE33126/"
filenm <- "GSE33126_series_matrix.txt.gz"
if(!file.exists("GSE33126_series_matrix.txt.gz")) download.file(paste(url, filenm, sep=""), destfile=filenm)
colonData <- getGEO(filename=filenm)
## File stored at: 
## /tmp/RtmpsdXCPn/GPL6947.soft
colonData
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 48803 features, 18 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM820516 GSM820517 ... GSM820533 (18 total)
##   varLabels: title geo_accession ... data_row_count (31 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: ILMN_1343291 ILMN_1343295 ... ILMN_2416019 (48803
##     total)
##   fvarLabels: ID nuID ... GB_ACC (30 total)
##   fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: GPL6947
exprs (colonData) <- log2 (exprs(colonData))
SampleGroup <- pData(colonData)$source_name_ch1
Patient <- pData(colonData)$characteristics_ch1.1

Filtering the data

A first step in the analysis of microarray data is often to remove any uniformative probes. We can do this because typically only 50% of probes genes will be expressed, and even fewer than this will be differentially expressed. Including such non-informative genes in visualisa- tion will obscure any biological differences that exist. The genefilter package contains a suite of tools for the filtering of microarray data. The varFilter function allows probes with low- variance to be removed from the dataset. The metric using to decide which probes to remove is the Inter-Quartile Range (IQR), and by default half of the probes are removed. Both the function used to do the filtering, and cut-off can be specified by the user.

library (genefilter)
## 
## Attaching package: 'genefilter'
## 
## The following object is masked from 'package:base':
## 
##     anyNA
dim (colonData)
## Features  Samples 
##    48803       18
varFiltered <- varFilter (colonData)
dim (varFiltered)
## Features  Samples 
##    24401       18
nrow (colonData) / nrow (varFiltered)
## Features 
## 2.000041

Calculating a distance matrix

The first step towards clustering the data is to construct a distance matrix. Each entry in this matrix is the pairwise distance between two samples. Note that for expression data we have to transpose the standard ExpressionSet format, as clustering is designed to work on rows rather than columns. The standard function to make a distance matrix in R is dist which uses the euclidean metric. As the data entries in a distance matrix are symmetrical, it has a different representation to the default matrix (i.e. values are not repeated unnecessar- ily), and clustering algorithms are able to accept this representation as input.

euc.dist <- dist (t(exprs(varFiltered)))
euc.dist
##           GSM820516 GSM820517 GSM820518 GSM820519 GSM820520 GSM820521
## GSM820517 146.35076                                                  
## GSM820518 112.10192 145.53738                                        
## GSM820519 127.02787 111.88327 128.89085                              
## GSM820520 103.88617 145.88415  93.26812 124.41043                    
## GSM820521 115.74149 109.59666 112.44568  77.60015 110.62329          
## GSM820522  86.59678 140.12151 115.68936 131.12234  98.96944 117.26705
## GSM820523 118.07552 115.02450 139.86229 101.32561 134.43368  89.05971
## GSM820524 113.05465 135.64296 106.83091 126.25701  85.61877 117.19595
## GSM820525 142.82777  72.50357 145.43070 101.26515 138.18094 107.93645
## GSM820526  87.68842 160.67923 109.56682 126.28151 102.83077 114.59309
## GSM820527 102.51958 121.92622 119.62366  78.01042 110.89035  69.13310
## GSM820528  85.71746 132.44295  96.04791 115.36869  75.14097 103.00987
## GSM820529 107.07671 118.90261 120.08406  94.25470 114.46866  79.01849
## GSM820530  80.36987 151.46130  94.29680 123.97811  86.59315 103.45768
## GSM820531 108.26678 100.31635 115.03569  82.27479 112.72244  64.87145
## GSM820532 128.93571  95.33138 115.04522 106.11806 116.81059 102.43630
## GSM820533 106.21944 134.12719 115.78666  97.39843 115.37280  68.75089
##           GSM820522 GSM820523 GSM820524 GSM820525 GSM820526 GSM820527
## GSM820517                                                            
## GSM820518                                                            
## GSM820519                                                            
## GSM820520                                                            
## GSM820521                                                            
## GSM820522                                                            
## GSM820523 105.62731                                                  
## GSM820524  85.85654 116.10403                                        
## GSM820525 127.62848  99.13618 120.46802                              
## GSM820526 107.40340 134.03152 114.20701 154.74502                    
## GSM820527 106.32390  77.38321 115.56748 109.20371 104.05182          
## GSM820528  89.80380 112.44894  90.45972 124.46834  97.24261  90.59615
## GSM820529 116.33941  83.80927 123.55735 107.46869 119.74686  59.39630
## GSM820530  97.19378 126.82149 107.83406 146.18764  87.43164 101.35185
## GSM820531 116.26826  91.51850 122.61883 103.07141 119.87820  70.55040
## GSM820532 124.92807 122.94258 116.57744 103.39879 137.24907 113.49486
## GSM820533 116.89886  93.17867 127.75562 130.21971 109.87582  64.16850
##           GSM820528 GSM820529 GSM820530 GSM820531 GSM820532
## GSM820517                                                  
## GSM820518                                                  
## GSM820519                                                  
## GSM820520                                                  
## GSM820521                                                  
## GSM820522                                                  
## GSM820523                                                  
## GSM820524                                                  
## GSM820525                                                  
## GSM820526                                                  
## GSM820527                                                  
## GSM820528                                                  
## GSM820529  86.75871                                        
## GSM820530  81.22235 103.68352                              
## GSM820531  96.07646  70.24274  99.39146                    
## GSM820532 105.14902 109.33036 120.61473  93.09175          
## GSM820533  99.91990  68.56935  96.27636  73.49531 119.19071

For gene-expression data, it is common to use correlation as a distance metric rather than the Euclidean. You should make sure that you know the difference between the two metrics. The cor function can be used to calculate the correlation of columns in a matrix. Each row (or column) in the resulting matrix is the correlation of that sample with all other samples in the dataset. The matrix is symmetrical and we can transform this into a distance matrix by first subtracting 1 from the correlation matrix. Hence, samples with a higher correlation have a smaller ’distance’.

corMat <- cor(exprs(varFiltered))
corMat
##           GSM820516 GSM820517 GSM820518 GSM820519 GSM820520 GSM820521
## GSM820516 1.0000000 0.8582704 0.9171976 0.8937222 0.9286023 0.9112575
## GSM820517 0.8582704 1.0000000 0.8600812 0.9173538 0.8588151 0.9202012
## GSM820518 0.9171976 0.8600812 1.0000000 0.8907537 0.9425576 0.9163944
## GSM820519 0.8937222 0.9173538 0.8907537 1.0000000 0.8978211 0.9602207
## GSM820520 0.9286023 0.8588151 0.9425576 0.8978211 1.0000000 0.9187180
## GSM820521 0.9112575 0.9202012 0.9163944 0.9602207 0.9187180 1.0000000
## GSM820522 0.9503610 0.8696543 0.9115547 0.8864231 0.9349881 0.9085925
## GSM820523 0.9081569 0.9126298 0.8713382 0.9324985 0.8806670 0.9475820
## GSM820524 0.9156037 0.8781899 0.9247661 0.8949583 0.9514800 0.9089643
## GSM820525 0.8648318 0.9650722 0.8601046 0.9322218 0.8731558 0.9224893
## GSM820526 0.9493074 0.8293478 0.9209805 0.8950744 0.9301280 0.9131154
## GSM820527 0.9302064 0.9009531 0.9051381 0.9597308 0.9180982 0.9681213
## GSM820528 0.9511778 0.8829546 0.9388145 0.9117054 0.9623691 0.9291090
## GSM820529 0.9241083 0.9061505 0.9047172 0.9413361 0.9130390 0.9584997
## GSM820530 0.9573796 0.8482370 0.9414236 0.8987844 0.9504108 0.9291209
## GSM820531 0.9223823 0.9331725 0.9125297 0.9552921 0.9156400 0.9720189
## GSM820532 0.8897132 0.9395367 0.9123688 0.9254827 0.9092395 0.9300936
## GSM820533 0.9255052 0.8809038 0.9116292 0.9374967 0.9118986 0.9686817
##           GSM820522 GSM820523 GSM820524 GSM820525 GSM820526 GSM820527
## GSM820516 0.9503610 0.9081569 0.9156037 0.8648318 0.9493074 0.9302064
## GSM820517 0.8696543 0.9126298 0.8781899 0.9650722 0.8293478 0.9009531
## GSM820518 0.9115547 0.8713382 0.9247661 0.8601046 0.9209805 0.9051381
## GSM820519 0.8864231 0.9324985 0.8949583 0.9322218 0.8950744 0.9597308
## GSM820520 0.9349881 0.8806670 0.9514800 0.8731558 0.9301280 0.9180982
## GSM820521 0.9085925 0.9475820 0.9089643 0.9224893 0.9131154 0.9681213
## GSM820522 1.0000000 0.9262929 0.9511794 0.8917089 0.9237271 0.9246435
## GSM820523 0.9262929 1.0000000 0.9111565 0.9350314 0.8817770 0.9603706
## GSM820524 0.9511794 0.9111565 1.0000000 0.9037952 0.9139681 0.9112436
## GSM820525 0.8917089 0.9350314 0.9037952 1.0000000 0.8415122 0.9204252
## GSM820526 0.9237271 0.8817770 0.9139681 0.8415122 1.0000000 0.9282016
## GSM820527 0.9246435 0.9603706 0.9112436 0.9204252 0.9282016 1.0000000
## GSM820528 0.9461775 0.9161095 0.9455812 0.8964670 0.9372344 0.9449786
## GSM820529 0.9101066 0.9536168 0.8988913 0.9232237 0.9051941 0.9764954
## GSM820530 0.9374851 0.8940675 0.9232363 0.8584350 0.9496136 0.9318156
## GSM820531 0.9101826 0.9446634 0.9003839 0.9293514 0.9049499 0.9668194
## GSM820532 0.8961047 0.8999407 0.9097986 0.9287621 0.8751776 0.9139221
## GSM820533 0.9094910 0.9427849 0.8921813 0.8875923 0.9203758 0.9726729
##           GSM820528 GSM820529 GSM820530 GSM820531 GSM820532 GSM820533
## GSM820516 0.9511778 0.9241083 0.9573796 0.9223823 0.8897132 0.9255052
## GSM820517 0.8829546 0.9061505 0.8482370 0.9331725 0.9395367 0.8809038
## GSM820518 0.9388145 0.9047172 0.9414236 0.9125297 0.9123688 0.9116292
## GSM820519 0.9117054 0.9413361 0.8987844 0.9552921 0.9254827 0.9374967
## GSM820520 0.9623691 0.9130390 0.9504108 0.9156400 0.9092395 0.9118986
## GSM820521 0.9291090 0.9584997 0.9291209 0.9720189 0.9300936 0.9686817
## GSM820522 0.9461775 0.9101066 0.9374851 0.9101826 0.8961047 0.9094910
## GSM820523 0.9161095 0.9536168 0.8940675 0.9446634 0.8999407 0.9427849
## GSM820524 0.9455812 0.8988913 0.9232363 0.9003839 0.9097986 0.8921813
## GSM820525 0.8964670 0.9232237 0.8584350 0.9293514 0.9287621 0.8875923
## GSM820526 0.9372344 0.9051941 0.9496136 0.9049499 0.8751776 0.9203758
## GSM820527 0.9449786 0.9764954 0.9318156 0.9668194 0.9139221 0.9726729
## GSM820528 1.0000000 0.9497681 0.9561975 0.9383683 0.9260060 0.9335861
## GSM820529 0.9497681 1.0000000 0.9288649 0.9672203 0.9204350 0.9688689
## GSM820530 0.9561975 0.9288649 1.0000000 0.9346090 0.9035232 0.9388144
## GSM820531 0.9383683 0.9672203 0.9346090 1.0000000 0.9422943 0.9642204
## GSM820532 0.9260060 0.9204350 0.9035232 0.9422943 1.0000000 0.9057118
## GSM820533 0.9335861 0.9688689 0.9388144 0.9642204 0.9057118 1.0000000
cor.dist <- as.dist(1 - corMat)

Hierachical clustering

Hierachical clustering is the method by which samples with similar profiles are grouped to- gether, based on their distances. The method for grouping samples can be specified, with the default being to use complete linkage. Different linkage methods can affect the properties of the clustering output. e.g. the ’ward’ method tends to produce smaller, compact clusters. A popular way of displaying the results of clustering is a dendrogram, which arranges the sam- ples on the x-axis and shows the distances between samples on the y-axis. It is important to note that the distance of samples on the x-axis has no meaning. The fact that two samples are displayed next to each other, does not m

par(mfrow = c (1 , 2))
clust.euclid = hclust(euc.dist)
clust.cor = hclust (cor.dist)
par (mfrow = c (1 , 2))
plot(clust.euclid , label = SampleGroup )
plot(clust.cor , label = SampleGroup )

Extracting data from the clustering

If we want to interpret the data presented in a clustering analysis, we need a way of extract- ing which samples are grouped together, or to determine the optimal grouping of samples. One intuitive way of assigning groups it to ’cut’ the dendrogram at a particular height on the y-axis. We can do this manually on the plot, or use the cutree function to return the labels of samples that are belong to the same group when the dendrogram is cut at the spec- ified height, h. Alternatively, we can specify how many groups, k, that we want to create.

library (cluster)
par (mfrow = c(1 , 1))
plot(clust.cor)
abline (h = 0.12, col = " red ")

cutree (clust.cor , h = 0.12)
## GSM820516 GSM820517 GSM820518 GSM820519 GSM820520 GSM820521 GSM820522 
##         1         2         1         2         1         2         1 
## GSM820523 GSM820524 GSM820525 GSM820526 GSM820527 GSM820528 GSM820529 
##         2         1         2         1         2         1         2 
## GSM820530 GSM820531 GSM820532 GSM820533 
##         1         2         2         2
cutree (clust.cor , k = 2)
## GSM820516 GSM820517 GSM820518 GSM820519 GSM820520 GSM820521 GSM820522 
##         1         2         1         2         1         2         1 
## GSM820523 GSM820524 GSM820525 GSM820526 GSM820527 GSM820528 GSM820529 
##         2         1         2         1         2         1         2 
## GSM820530 GSM820531 GSM820532 GSM820533 
##         1         2         2         2
table (cutree(clust.cor , k = 3) , SampleGroup)
##    SampleGroup
##     normal tumor
##   1      0     8
##   2      2     1
##   3      7     0

A Silhouette plot can be used to choose the optimal number of clusters. For each sample, we calculate a value that quantifies how well it ’fits’ the cluster that it has been assigned to. If the value is around 1, then the sample closely fits other samples in the same cluster. How- ever, if the value is around 0 the sample could belong to another cluster. In the silhouette plot, the values for each cluster are plotted together and ordered from largest to smallest. The number of samples belonging to each group is also displayed.

par (mfrow = c (2 , 2))
plot (silhouette(cutree(clust.cor, k=2),cor.dist),
      col="red",main=paste("k=",2))
plot (silhouette(cutree(clust.cor, k=3),cor.dist),
      col="red",main=paste("k=",3))
plot (silhouette(cutree(clust.cor, k=4),cor.dist),
      col="red",main=paste("k=",4))
plot (silhouette(cutree(clust.cor, k=5),cor.dist),
      col="red",main=paste("k=",5))

If we have prior knowledge of how many clusters to expect, we could run the clustering in a supervised manner. The Partition Around Medioids method can be used to group samples into k clusters.

pam.clus <- pam (euc.dist , k = 2)
clusplot (pam.clus)

pam.clus$clustering
## GSM820516 GSM820517 GSM820518 GSM820519 GSM820520 GSM820521 GSM820522 
##         1         2         1         2         1         2         1 
## GSM820523 GSM820524 GSM820525 GSM820526 GSM820527 GSM820528 GSM820529 
##         2         1         2         1         2         1         2 
## GSM820530 GSM820531 GSM820532 GSM820533 
##         1         2         2         2
table(pam.clus$clustering , SampleGroup)
##    SampleGroup
##     normal tumor
##   1      0     8
##   2      9     1

Producing a heatmap

A heatmap is often used to visualise differences between samples. Each row represents a gene and each column is an array and coloured cells indicate the expression levels of genes. Both samples and genes with similar expression profile are clustered together. By default, euclidean distances are used with complete linkage clustering. Drawing a heatmap in R uses a lot of memory and can take a long time, therefore reduc- ing the amount of data to be plotted is usually recommended. Including too many non- informative genes can also make it difficult to spot patterns. Typically, data are filtered to include the genes which tell us the most about the biological variation. In an un-supervised setting, the selection of such genes is done without using prior knowledge about the sample groupings.

IQRs = apply (exprs(varFiltered) , 1 , IQR )
highVarGenes = order (IQRs, decreasing = T )[1:100]
Symbols <- as.character(fData(colonData)$Symbol[highVarGenes])
heatmap (as.matrix(exprs(varFiltered)[highVarGenes, ]),
         labCol = SampleGroup , labRow = Symbols)

The default options for the heatmap are to cluster both the genes (rows) and samples (columns). However, sometimes we might want to specify a particular order. For example, we might want to order the columns according to sample groups. We can do this by re-ordering the input matrix manually and setting the Colv to NA. This tells the heatmap function not be cluster the columns. It choosing this option, we need to be careful that any column colour- ings or labels are set to the same ordering. Alternatively, a pre-calculated dendrogram could be used.

clus.ward <- hclust (cor.dist , method = "ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
heatmap (as.matrix(exprs(varFiltered)[highVarGenes, ]) ,
         Colv = as.dendrogram(clus.ward) , labCol = SampleGroup )

Customising the heatmap

The heatmap function can be customised in many ways to make the output more informa- tive. For example, the labRow and labCol parameters can be used to give labels to the rows (genes) and columns (sample) of the heatmap. Similarly, ColSideColors and RowSideCol- ors give coloured labels, often used to indicate different groups which are know in advance. See the help page for heatmap for more details.

labs <- as.factor(SampleGroup)
levels(labs) <- c ("yellow" , "blue")
heatmap(as.matrix(exprs(varFiltered)[highVarGenes, ]) ,
        labCol = Patient, ColSideColors = as.character(labs),
        labRow = Symbols)

The colours used to display the gene expression values can also be modified. For this, we can use the RColorBrewer package which has functions for creating pre-defined palettes. The function display.brewer.all can be used to display the palettes available through this package.

library (RColorBrewer)
display.brewer.all()

hmcol <- brewer.pal(11 , "RdBu")
heatmap (as.matrix(exprs(varFiltered)[highVarGenes, ]) ,
  ColSideColors = as.character(labs) , labRow = Symbols,
  col=hmcol)

You should avoid using the traditional red / green colour scheme as it may be difficult for people with colour-blindness to interpret

A popular use for heatmaps is to take an existing gene list (e.g. genes found to be significant in a previous study, or genes belonging to a particular pathway) and produce an image of how they cluster the data for exploratory purposes. This can be achieved easily by selecting the correct rows in the data matrix.

library (illuminaHumanv3.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: IRanges
## Loading required package: S4Vectors
## Loading required package: org.Hs.eg.db
## Loading required package: DBI
pathwayGenes <- unlist (mget("04110" , revmap(illuminaHumanv3PATH)))
pathwayGenes <- pathwayGenes [pathwayGenes %in% featureNames(varFiltered)]
symbols <- fData(varFiltered)[pathwayGenes , "Symbol"]
heatmap (as.matrix(exprs(varFiltered)[pathwayGenes , ]),
  ColSideColors = as.character (labs) , labCol = Patient ,
  labRow = symbols , col = hmcol )

Principal Components Analysis

Principal components analysis (PCA) is a data reduction technique that allows us to sim- plify multidimensional data sets to 2 or 3 dimensions for plotting purposes and identify which factors explain the most variability in the data. We can use the prcomp function to perform a PCA and we have to supply it with a distance matrix. The resulting object contains infor- mation on the proportion of variance that is ’explained’ by each component. Ideally, we want to see that the majority of the variance is explained by the first 2 or 3 components, and that these components are associated with a biological factor

pca <- prcomp(exprs(varFiltered))
plot(pca)

summary(pca)
## Importance of components:
##                          PC1     PC2     PC3    PC4     PC5     PC6
## Standard deviation     7.171 1.13645 0.84320 0.6838 0.53334 0.50432
## Proportion of Variance 0.924 0.02321 0.01278 0.0084 0.00511 0.00457
## Cumulative Proportion  0.924 0.94724 0.96002 0.9684 0.97353 0.97810
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.44891 0.40378 0.38232 0.36830 0.32469 0.30894
## Proportion of Variance 0.00362 0.00293 0.00263 0.00244 0.00189 0.00172
## Cumulative Proportion  0.98172 0.98465 0.98728 0.98972 0.99161 0.99333
##                           PC13    PC14    PC15    PC16    PC17   PC18
## Standard deviation     0.27589 0.27127 0.25520 0.24435 0.22922 0.2104
## Proportion of Variance 0.00137 0.00132 0.00117 0.00107 0.00094 0.0008
## Cumulative Proportion  0.99469 0.99602 0.99719 0.99826 0.99920 1.0000

The ggplot2 package is convenient for plotting principal components results as it allows many different covariates to be overlaid on the plot. See http://ggplot2.org/ for more informa- tion on this package.

library(ggplot2)
clusLabs <- cutree(clust.cor , k = 3)
pcRes <- data.frame(pca$rotation , SampleGroup , Sample = Patient)
ggplot (pcRes , aes(x = PC1 , y = PC2 , col = SampleGroup ,
                       label = Patient , pch = as.factor(clusLabs))) + geom_point () +
      geom_text(vjust=0,alpha=0.5)

Gene-Ontology Analysis

In this section we give an example of how to find a list of relevant pathways / GO terms from a list of differentially-expressed genes. We will use the colon cancer data that we down- loaded from GEO.

We are now going to create a gene universe by removing genes for will not contribute to the subsequent analysis. Such filtering is done without regarding the phenotype variables - hence a ”non-specific” filter. An Illumina Human6 chip contains around 48,00 probes, but less than half of these have enough detailed information to useful for a GO analysis. Therefore we re- strict the dataset to only probes for which we have a Entrez ID. It is also recommended to select probes with sufficient variability across samples to be interesting; as probes with little variability will no be interesting to the question we are trying to answer. The interquartile- range of each probe across all arrays is commonly used for this with a cut-off of 0.5.

library (illuminaHumanv3.db)
entrezIds = mget(rownames(exprs(colonData)) , illuminaHumanv3ENTREZID ,
  ifnotfound = NA )

haveEntrezId = names(entrezIds)[sapply(entrezIds , function (x) !is.na(x))]

entrezSubset = exprs (colonData)[haveEntrezId , ]

entrezIQR = apply(entrezSubset, 1 , IQR)

selected = entrezIQR > 0.5

nsFiltered = entrezSubset [selected ,]

universeIds = unlist (mget(rownames(nsFiltered ) , illuminaHumanv3ENTREZID ,
  ifnotfound = NA ))

Remember that the size of the universe can have an effect on the analysis. If the universe is made artificially large by including too many uninformative probes, the p-values for the GO terms will appear more significant.

We now test the genes in the universe to see which ones have significant differences between the two groups. For this, we use the rowttests function implemented in the genefilter pack- age, which performs a t-test for each row with respect to a factor. The p-values of the test can be extracted, with one p-value given for each probe.

library (GOstats)
## Loading required package: Category
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following object is masked from 'package:IRanges':
## 
##     expand
## 
## Loading required package: GO.db
## 
## Loading required package: graph
## 
## Attaching package: 'GOstats'
## 
## The following object is masked from 'package:AnnotationDbi':
## 
##     makeGOGraph
library (genefilter)
fac = as.factor(SampleGroup)
ttests = rowttests(as.matrix(nsFiltered) , fac)

smPV = ttests$p.value < 0.05

pvalFiltered = nsFiltered [smPV , ]

dim (pvalFiltered)
## [1] 3799   18
selectedEntrezIds = unlist (mget(rownames(pvalFiltered) ,
  illuminaHumanv3ENTREZID , ifnotfound = NA))

The hyperGTest function is used to do the hypergeometric test for GO terms. Rather than passing a long list of parameters to the function. An object of type GOHyperGParams is created to hold all the parameters we need to run the hypergeometric test. This object can then be passed to hyperGTest multiple times without having to re-type the parameters each time. The meanings of these parameters are as follows:

params = new ("GOHyperGParams" , geneIds = selectedEntrezIds , 
              universeGeneIds = universeIds , annotation = "illuminaHumanv3" ,
              ontology =  "BP" , pvalueCutoff = 0.05 , conditional = FALSE ,
              testDirection = "over")
## Warning in makeValidParams(.Object): removing duplicate IDs in geneIds
## Warning in makeValidParams(.Object): removing duplicate IDs in
## universeGeneIds
hgOver = hyperGTest(params)
hgOver
## Gene to GO BP  test for over-representation 
## 10029 GO BP ids tested (564 have p < 0.05)
## Selected gene set size: 2859 
##     Gene universe size: 6588 
##     Annotation package: illuminaHumanv3

The summary function can be used to view the results of the test in matrix form. The rows of the matrix are arranged in order of significance. The p-value is shown for each GO term along with with total number of genes for that GO term, number of genes we would be ex- pect to appear in the gene list by chance and that number that were observed. A descriptive name is also given for each term. The results can also be printed out to a HTML report us- ing htmlReport.

summary (hgOver)[1:20 , ]
##        GOBPID       Pvalue OddsRatio  ExpCount Count Size
## 1  GO:0022613 1.237776e-11  2.929790  74.64299   118  172
## 2  GO:0042254 1.092750e-10  3.640835  48.60474    82  112
## 3  GO:1903047 1.382925e-08  1.736592 193.11703   250  445
## 4  GO:0034660 2.184857e-08  2.372528  75.94490   112  175
## 5  GO:0007059 2.959786e-08  2.791676  53.81239    84  124
## 6  GO:0051301 4.064662e-08  1.870589 138.00273   185  318
## 7  GO:0007067 4.746980e-08  2.155316  90.69991   129  209
## 8  GO:0016072 5.650406e-08  3.830055  32.11384    55   74
## 9  GO:0006364 7.719281e-08  3.894096  30.81193    53   71
## 10 GO:0010564 1.304123e-07  1.904748 119.77596   162  276
## 11 GO:0000280 1.571832e-07  1.961002 108.05874   148  249
## 12 GO:0070507 4.034994e-07  3.669752  29.51002    50   68
## 13 GO:0048285 4.565099e-07  1.877784 113.70036   153  262
## 14 GO:0022402 6.206269e-07  1.507148 273.83561   332  631
## 15 GO:0098813 1.118107e-06  2.939145  37.75546    60   87
## 16 GO:0000278 2.788264e-06  1.517878 228.70264   279  527
## 17 GO:0032886 3.647483e-06  2.906134  34.71767    55   80
## 18 GO:0006396 7.660386e-06  1.659162 137.13479   175  316
## 19 GO:0000070 7.916781e-06  3.099627  29.07605    47   67
## 20 GO:0003012 8.056638e-06  1.964313  76.37887   105  176
##                                                   Term
## 1                 ribonucleoprotein complex biogenesis
## 2                                  ribosome biogenesis
## 3                           mitotic cell cycle process
## 4                              ncRNA metabolic process
## 5                               chromosome segregation
## 6                                        cell division
## 7                             mitotic nuclear division
## 8                               rRNA metabolic process
## 9                                      rRNA processing
## 10                    regulation of cell cycle process
## 11                                    nuclear division
## 12 regulation of microtubule cytoskeleton organization
## 13                                   organelle fission
## 14                                  cell cycle process
## 15                      nuclear chromosome segregation
## 16                                  mitotic cell cycle
## 17             regulation of microtubule-based process
## 18                                      RNA processing
## 19                mitotic sister chromatid segregation
## 20                               muscle system process

GOstats also has the facility to test for KEGG pathways and chromosome bands which are over-reprsented. The procedure of creating a gene universe and set of selected genes is the same. However, we have to use a different object for the parameters, as not all

keggParams = new ("KEGGHyperGParams" , geneIds = selectedEntrezIds ,
  universeGeneIds = universeIds , annotation = "illuminaHumanv3" ,
  pvalueCutoff = 0.05 , testDirection = "over" )
## Warning in makeValidParams(.Object): removing duplicate IDs in geneIds
## Warning in makeValidParams(.Object): removing duplicate IDs in
## universeGeneIds
keggHgOver = hyperGTest (keggParams)

summary (keggHgOver)
## 
## KEGG.db contains mappings based on older data because the original
##   resource was removed from the the public domain before the most
##   recent update was produced. This package should now be
##   considered deprecated and future versions of Bioconductor may
##   not have it available.  Users who want more current data are
##   encouraged to look at the KEGGREST or reactome.db packages
##    KEGGID       Pvalue OddsRatio  ExpCount Count Size
## 1   03008 6.925084e-06  4.965818 17.303241    31   39
## 2   03013 4.268631e-04  2.446669 28.395062    42   64
## 3   03030 1.273628e-03  4.020631 11.091821    19   25
## 4   03020 3.349777e-03       Inf  3.105710     7    7
## 5   00230 4.485758e-03  1.843030 36.824846    49   83
## 6   00140 6.564118e-03  3.376837  9.760802    16   22
## 7   03040 9.196217e-03  1.999105 23.958333    33   54
## 8   00565 9.201126e-03  3.541901  8.429784    14   19
## 9   04110 1.144609e-02  1.679426 39.043210    50   88
## 10  04975 1.826333e-02  2.949531  8.873457    14   20
## 11  00240 1.837727e-02  1.874590 23.070988    31   52
## 12  04020 2.009462e-02  1.682742 31.944444    41   72
## 13  04610 4.616556e-02  1.818878 17.303241    23   39
##                                   Term
## 1    Ribosome biogenesis in eukaryotes
## 2                        RNA transport
## 3                      DNA replication
## 4                       RNA polymerase
## 5                    Purine metabolism
## 6         Steroid hormone biosynthesis
## 7                          Spliceosome
## 8               Ether lipid metabolism
## 9                           Cell cycle
## 10        Fat digestion and absorption
## 11               Pyrimidine metabolism
## 12           Calcium signaling pathway
## 13 Complement and coagulation cascades
chrParams = new ("ChrMapHyperGParams" , geneIds = selectedEntrezIds ,
universeGeneIds = universeIds , annotation = "illuminaHumanv3" ,
pvalueCutoff = 0.05 , testDirection = "over" , conditional = TRUE )
## Warning in makeValidParams(.Object): removing duplicate IDs in geneIds
## Warning in makeValidParams(.Object): removing duplicate IDs in
## universeGeneIds
chrHgOver = hyperGTest (chrParams)
## Warning: unexpected band notation: Xq13.3-Xq21.2 using Xq
## Warning in makeChrMapToEntrez(chip, univ): dropping invalid items: tdb7990
summary (chrHgOver)
##    ChrMapID      Pvalue OddsRatio   ExpCount Count Size
## 1      2p12 0.006558192       Inf   2.597003     6    6
## 2       5q1 0.006941990  1.863516  30.298368    41   70
## 3    5q35.1 0.010925116  5.910260   4.761172     9   11
## 4      Xp11 0.011416785  2.138478  18.179021    26   42
## 5   6q22.31 0.014104042  9.190211   3.462671     7    8
## 6    3q22.1 0.021018600  5.251937   4.328338     8   10
## 7      6p25 0.021365911  3.152839   7.358175    12   17
## 8     16q13 0.021365911  3.152839   7.358175    12   17
## 9      9p13 0.024369198  1.972750  17.313353    24   40
## 10     2q21 0.026927993  3.939244   5.194006     9   12
## 11       7q 0.028073272  1.290190 107.775622   123  249
## 12     6q22 0.028139668  2.463703   9.955178    15   23
## 13     7q33 0.028893629  7.874884   3.029837     6    7
## 14       20 0.028991119  1.320562  89.163768   103  206
## 15   3p21.1 0.031836512  3.282946   6.059674    10   14
## 16      3p2 0.031950526  1.371733  66.223575    78  153
## 17      8q2 0.033685661  1.351734  70.984747    83  164
## 18   5q12.1 0.035061273       Inf   1.731335     4    4
## 19  8q24.21 0.035061273       Inf   1.731335     4    4
## 20  Xp22.11 0.035061273       Inf   1.731335     4    4
## 21     16q1 0.035452834  2.896575   6.915382    11   16
## 22    10p13 0.035855530  2.889206   6.925341    11   16
## 23     12q2 0.037539376  1.407122  52.805727    63  122
## 24    10q11 0.039115387  2.626745   7.791009    12   18
## 25   8q24.1 0.039115387  2.626745   7.791009    12   18
## 26    10p15 0.039689671  4.594021   3.895504     7    9
## 27      10q 0.040364684  2.456356   8.622919    13   20
## 28  21q22.1 0.041737298  2.439301   8.656676    13   20
## 29     4q13 0.043825908  2.298743   9.522344    14   22
## 30     10q2 0.046678748  1.281341  88.730934   101  205
## 31     20q1 0.047805062  1.357580  57.999732    68  134

A one-off test

Ocassionally, we might want to check if a particular set of genes appear to be up- or down- regulated in an analysis. This can be checked using the geneSetTest function in limma, which computes a p-value to test the hypothesis that the selected genes have more extreme test-statistics than one might expect by chance. Moreover, separate tests can be performed to see if the selected genes are up-regulated (alternative=up) or down-regulated (alterna- tive=down). The default is to test for extreme statistics regardless of sign.

library (limma)

ccGenes <- unlist (mget("04110" , revmap (illuminaHumanv3PATH)))

ccInd <- which (rownames(nsFiltered)  %in% ccGenes)

geneSetTest (index = ccInd , statistics = ttests$statistic)
## [1] 0.0001225606
geneSetTest (index = ccInd , statistics = ttests$statistic ,
  alternative = "down" )
## [1] 6.774774e-18
geneSetTest (index = ccInd , statistics = ttests$statistic ,
  alternative = "up" )
## [1] 1
barcodeplot (ttests$statistic , index = ccInd )

plot (density(ttests$statistic))

lines (density (ttests$statistic[ccInd]) , col = " red " )

Classification

The Bioconductor project has a collection of example datasets. Often these are used as ex- amples to illustrate a particular package or functionality, or to accompany the analysis pre- sented in a publication. For example, several datasets are presented to accompany the genefu package which has functions useful for the classification of breast cancer patients based on expression profiles. An experimental dataset can be installed and loaded as with any other Bioconductor pack- age. The data itself is saved as an object in the package. You will need to see the documen- tation for the package to find out the relevant object name. The full list of datasets available through Bioconductor can be found here

library(breastCancerVDX)
library(breastCancerTRANSBIG)
data(vdx)
data(transbig)
dim(vdx)
## Features  Samples 
##    22283      344
dim(transbig)
## Features  Samples 
##    22283      198
annotation(vdx)
## [1] "hgu133a"
annotation(transbig)
## [1] "hgu133a"

If we want any classifers to be reproducible and applicable to other datasets, it is sensible to exclude probes that do not have sufficient annotation from the analysis. For this, we can use the genefilter package as before. The nsFilter function performs this annotation-based filtering as well as variance filtering. The output of the function includes details about how many probes were removed at each stage of the filtering.

library (genefilter)
vdx.filt <- nsFilter(vdx)
## 
vdx.filt
## $eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 6221 features, 344 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: VDX_3 VDX_5 ... VDX_2038 (344 total)
##   varLabels: samplename dataset ... e.os (21 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 206797_at 203440_at ... 201130_s_at (6221 total)
##   fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
##   pubMedIds: 17420468 
## Annotation: hgu133a 
## 
## $filter.log
## $filter.log$numDupsRemoved
## [1] 7408
## 
## $filter.log$numLowVar
## [1] 6221
## 
## $filter.log$numRemoved.ENTREZID
## [1] 2423
## 
## $filter.log$feature.exclude
## [1] 10
vdx.filt <- vdx.filt[[1]]

Format the vdx data for pamr, and train a classifier to predict ER status. For extra clarity in the results, it might be useful to rename the binary er status used in the data package to something more descriptive.

library(pamr)
## Loading required package: survival
dat <- exprs(vdx.filt)
gN <- as.character(fData(vdx.filt)$Gene.symbol)
gI <- featureNames (vdx.filt)
sI <- sampleNames (vdx.filt)
erStatus <- pData (vdx)$er
erStatus <- gsub (0 , "ER -" , erStatus )
erStatus <- gsub (1 , "ER +" , erStatus )

Fitting the model

train.dat <- list ( x = dat , y = erStatus , genenames = gN ,
              geneid = gI , sampleid = sI )
model <- pamr.train(train.dat ,n.threshold = 100)
## 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
model
## Call:
## pamr.train(data = train.dat, n.threshold = 100)
##     threshold nonzero errors
## 1    0.000    6221    41    
## 2    0.130    5803    41    
## 3    0.260    5365    39    
## 4    0.390    4926    39    
## 5    0.519    4501    39    
## 6    0.649    4157    38    
## 7    0.779    3789    38    
## 8    0.909    3440    38    
## 9    1.039    3144    38    
## 10   1.169    2884    36    
## 11   1.299    2624    34    
## 12   1.428    2377    34    
## 13   1.558    2197    34    
## 14   1.688    2000    33    
## 15   1.818    1823    34    
## 16   1.948    1663    33    
## 17   2.078    1538    33    
## 18   2.208    1405    34    
## 19   2.337    1281    33    
## 20   2.467    1164    33    
## 21   2.597    1067    32    
## 22   2.727     994    32    
## 23   2.857     910    33    
## 24   2.987     823    33    
## 25   3.117     753    33    
## 26   3.246     697    32    
## 27   3.376     643    31    
## 28   3.506     577    32    
## 29   3.636     522    33    
## 30   3.766     460    33    
## 31   3.896     398    33    
## 32   4.026     360    33    
## 33   4.155     319    33    
## 34   4.285     284    33    
## 35   4.415     246    33    
## 36   4.545     219    33    
## 37   4.675     191    33    
## 38   4.805     170    34    
## 39   4.935     155    34    
## 40   5.064     132    34    
## 41   5.194     117    35    
## 42   5.324     103    35    
## 43   5.454      89    36    
## 44   5.584      77    36    
## 45   5.714      68    36    
## 46   5.844      63    36    
## 47   5.973      57    36    
## 48   6.103      54    37    
## 49   6.233      49    37    
## 50   6.363      44    37    
## 51   6.493      41    37    
## 52   6.623      37    37    
## 53   6.753      34    36    
## 54   6.882      32    37    
## 55   7.012      28    37    
## 56   7.142      27    37    
## 57   7.272      25    38    
## 58   7.402      23    39    
## 59   7.532      19    41    
## 60   7.661      17    42    
## 61   7.791      17    42    
## 62   7.921      17    43    
## 63   8.051      17    43    
## 64   8.181      16    43    
## 65   8.311      15    44    
## 66   8.441      13    43    
## 67   8.570      11    48    
## 68   8.700       8    48    
## 69   8.830       7    48    
## 70   8.960       6    51    
## 71   9.090       6    54    
## 72   9.220       6    56    
## 73   9.350       5    57    
## 74   9.479       4    58    
## 75   9.609       4    59    
## 76   9.739       4    63    
## 77   9.869       3    66    
## 78   9.999       3    69    
## 79  10.129       3    73    
## 80  10.259       3    79    
## 81  10.388       3    100   
## 82  10.518       3    118   
## 83  10.648       2    134   
## 84  10.778       2    135   
## 85  10.908       2    135   
## 86  11.038       1    135   
## 87  11.168       1    135   
## 88  11.297       1    135   
## 89  11.427       1    135   
## 90  11.557       1    135   
## 91  11.687       1    135   
## 92  11.817       1    135   
## 93  11.947       1    135   
## 94  12.077       1    135   
## 95  12.206       1    135   
## 96  12.336       1    135   
## 97  12.466       1    135   
## 98  12.596       1    135   
## 99  12.726       1    135   
## 100 12.856       0    135

We can perform cross-validation using the pamr.cv function. Printing the output of this function shows a table of how many genes were used at each threshold, and the number of classification errors. Both these values need to be taken into account when choosing a suit- able theshold. The pamr.plotcv function can assist with this by producing a diagnostic plot which shows how the error changes with the number of genes. In the plot produced by this function there are two panels; the top one shows the errors in the whole dataset and the bot- tom one considers each class separately. In each panel, the x axis corresponds to the thresh- old (and number of genes at each threshold) whereas the y-axis is the number of misclassifi- cations.

model.cv <- pamr.cv(model , train.dat , nfold = 10)
## 12Fold 1 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 2 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 3 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 4 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 5 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 6 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 7 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 8 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 9 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 10 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
model.cv
## Call:
## pamr.cv(fit = model, data = train.dat, nfold = 10)
##     threshold nonzero errors
## 1    0.000    6221    43    
## 2    0.130    5803    43    
## 3    0.260    5365    43    
## 4    0.390    4926    43    
## 5    0.519    4501    42    
## 6    0.649    4157    42    
## 7    0.779    3789    40    
## 8    0.909    3440    39    
## 9    1.039    3144    39    
## 10   1.169    2884    39    
## 11   1.299    2624    40    
## 12   1.428    2377    40    
## 13   1.558    2197    39    
## 14   1.688    2000    37    
## 15   1.818    1823    38    
## 16   1.948    1663    37    
## 17   2.078    1538    38    
## 18   2.208    1405    38    
## 19   2.337    1281    38    
## 20   2.467    1164    38    
## 21   2.597    1067    38    
## 22   2.727     994    37    
## 23   2.857     910    36    
## 24   2.987     823    36    
## 25   3.117     753    36    
## 26   3.246     697    36    
## 27   3.376     643    36    
## 28   3.506     577    35    
## 29   3.636     522    34    
## 30   3.766     460    35    
## 31   3.896     398    34    
## 32   4.026     360    34    
## 33   4.155     319    34    
## 34   4.285     284    34    
## 35   4.415     246    34    
## 36   4.545     219    34    
## 37   4.675     191    34    
## 38   4.805     170    34    
## 39   4.935     155    35    
## 40   5.064     132    35    
## 41   5.194     117    36    
## 42   5.324     103    36    
## 43   5.454      89    36    
## 44   5.584      77    36    
## 45   5.714      68    37    
## 46   5.844      63    37    
## 47   5.973      57    38    
## 48   6.103      54    38    
## 49   6.233      49    38    
## 50   6.363      44    38    
## 51   6.493      41    39    
## 52   6.623      37    39    
## 53   6.753      34    40    
## 54   6.882      32    38    
## 55   7.012      28    38    
## 56   7.142      27    38    
## 57   7.272      25    39    
## 58   7.402      23    40    
## 59   7.532      19    41    
## 60   7.661      17    42    
## 61   7.791      17    42    
## 62   7.921      17    42    
## 63   8.051      17    43    
## 64   8.181      16    44    
## 65   8.311      15    44    
## 66   8.441      13    45    
## 67   8.570      11    47    
## 68   8.700       8    49    
## 69   8.830       7    49    
## 70   8.960       6    52    
## 71   9.090       6    55    
## 72   9.220       6    58    
## 73   9.350       5    58    
## 74   9.479       4    61    
## 75   9.609       4    61    
## 76   9.739       4    64    
## 77   9.869       3    66    
## 78   9.999       3    69    
## 79  10.129       3    75    
## 80  10.259       3    88    
## 81  10.388       3    109   
## 82  10.518       3    117   
## 83  10.648       2    126   
## 84  10.778       2    134   
## 85  10.908       2    135   
## 86  11.038       1    135   
## 87  11.168       1    135   
## 88  11.297       1    135   
## 89  11.427       1    135   
## 90  11.557       1    135   
## 91  11.687       1    135   
## 92  11.817       1    135   
## 93  11.947       1    135   
## 94  12.077       1    135   
## 95  12.206       1    135   
## 96  12.336       1    135   
## 97  12.466       1    135   
## 98  12.596       1    135   
## 99  12.726       1    135   
## 100 12.856       0    135
pamr.plotcv(model.cv)

In the following sections, feel free to experiment with different values of the threshold (which we will call Delta) The misclassifications can easily be visualised as a ’confusion table’. This simply tabulates the classes assigned to each sample against the original label assigned to the sample. e.g. Misclassifications are samples that we thought were ’ER+’ but have been assigned to the ’ER-’ group by the classifier, or ’ER-’ samples assigned as ’ER+’ by the classifier.

Delta <- 8
pamr.confusion(model.cv , Delta)
##      ER - ER + Class Error rate
## ER -  106   29       0.21481481
## ER +   14  195       0.06698565
## Overall error rate= 0.125

A visual representation of the class separation can be obtained using the pamr.plotcvprob function. For each sample there are two circles representing the probabilty of that sample being classified ER- (red) or ER+ (green).

pamr.plotcvprob(model , train.dat , Delta )

There are a couple of ways of extract the details of the genes that have been used in the classifier. We can list their names using the pamr.listgenes function, which in our case these are just returns the microarray probe names. We can however, use these IDs to query the featureData stored with the original vdx object. We can also plot the expression values for each gene, coloured according to the class label.

pamr.listgenes(model , train.dat , Delta )
##       id          ER --score ER +-score
##  [1,] 205225_at   -0.3257    0.2104    
##  [2,] 209173_at   -0.202     0.1305    
##  [3,] 209603_at   -0.1753    0.1133    
##  [4,] 204508_s_at -0.1184    0.0765    
##  [5,] 205009_at   -0.0987    0.0638    
##  [6,] 214440_at   -0.083     0.0536    
##  [7,] 205186_at   -0.0583    0.0376    
##  [8,] 219197_s_at -0.0507    0.0327    
##  [9,] 209459_s_at -0.0453    0.0292    
## [10,] 212956_at   -0.0425    0.0274    
## [11,] 218976_at   -0.0402    0.026     
## [12,] 203929_s_at -0.035     0.0226    
## [13,] 204623_at   -0.0325    0.021     
## [14,] 215729_s_at 0.0295     -0.0191   
## [15,] 209800_at   0.0211     -0.0136   
## [16,] 218211_s_at -0.018     0.0116    
## [17,] 205862_at   -0.0109    0.0071
classifierGenes <- pamr.listgenes(model , train.dat , Delta )[,1]
##       id          ER --score ER +-score
##  [1,] 205225_at   -0.3257    0.2104    
##  [2,] 209173_at   -0.202     0.1305    
##  [3,] 209603_at   -0.1753    0.1133    
##  [4,] 204508_s_at -0.1184    0.0765    
##  [5,] 205009_at   -0.0987    0.0638    
##  [6,] 214440_at   -0.083     0.0536    
##  [7,] 205186_at   -0.0583    0.0376    
##  [8,] 219197_s_at -0.0507    0.0327    
##  [9,] 209459_s_at -0.0453    0.0292    
## [10,] 212956_at   -0.0425    0.0274    
## [11,] 218976_at   -0.0402    0.026     
## [12,] 203929_s_at -0.035     0.0226    
## [13,] 204623_at   -0.0325    0.021     
## [14,] 215729_s_at 0.0295     -0.0191   
## [15,] 209800_at   0.0211     -0.0136   
## [16,] 218211_s_at -0.018     0.0116    
## [17,] 205862_at   -0.0109    0.0071
pamr.geneplot(model , train.dat ,Delta)

You may get an error message Error in plot.new(): Figure margins too large when trying to produce the gene plot. If this occurs, try increasing the size of your plotting window, or decrease the number of genes by decreasing the threshold. Alternatively, the fol- lowing code will write the plots to a pdf.

pdf ("classifierProfiles.pdf")
for (i in 1: length (classifierGenes)) {
  Symbol <- fData(vdx.filt)[classifierGenes[i] , "Gene.symbol"]
  boxplot(exprs(vdx.filt)[classifierGenes[i], ] ~ erStatus ,
  main = Symbol )
}
dev.off()
## png 
##   2

Use the genes identified by the classifier to produce a heatmap to confirm that they separate the samples as expected.

symbols <- fData(vdx.filt)[classifierGenes , "Gene.symbol"]
heatmap(exprs(vdx.filt)[classifierGenes, ] , labRow = symbols )

Testing the model

We can now test the classifier on an external dataset. We choose the transbig dataset for simplicity as it was generated on the same microarray platform

library (breastCancerTRANSBIG)
data (transbig)
pData (transbig)[1:4, ]
##                samplename  dataset  series   id         filename size age er grade pgr her2 brca.mutation e.dmfs t.dmfs node t.rfs e.rfs treatment tissue t.os e.os
## VDXGUYU_4002 VDXGUYU_4002 TRANSBIG VDXGUYU 4002 GSM177885.CEL.gz  3.0  57  0     3  NA   NA            NA      1    723    0   723     1         0      1  937    1
## VDXGUYU_4008 VDXGUYU_4008 TRANSBIG VDXGUYU 4008 GSM177886.CEL.gz  3.0  57  1     3  NA   NA            NA      0   6591    0   183     1         0      1 6591    0
## VDXGUYU_4011 VDXGUYU_4011 TRANSBIG VDXGUYU 4011 GSM177887.CEL.gz  2.5  48  0     3  NA   NA            NA      1    524    0   524     1         0      1  922    1
## VDXGUYU_4014 VDXGUYU_4014 TRANSBIG VDXGUYU 4014 GSM177888.CEL.gz  1.8  42  1     3  NA   NA            NA      1   6255    0  2192     1         0      1 6255    1
transbig.filt <- transbig [featureNames(vdx.filt) , ]
predClass <- pamr.predict(model ,exprs(transbig.filt) ,Delta )
table (predClass, pData(transbig)$ er)
##          
## predClass   0   1
##      ER -  52  11
##      ER +  12 123
boxplot (pamr.predict(model , exprs(transbig.filt), Delta ,
                           type = "posterior" )[, 1] ~ pData(transbig)$er)

Make a heatmap of the transbig data using the genes involved in the vxd classifier

erLab <- as.factor(pData(transbig)$er)
levels (erLab) <- c ("blue" , "yellow")

heatmap (exprs(transbig.filt)[classifierGenes , ] , labRow = symbols ,
  ColSideColors = as.character (erLab))

Survival Analysis

An attractive feature of the vdx dataset is that it includes survival data for each breast can- cer patient. We are not explicitly covering survival analysis in this course, but for your ref- erence, here are the commands to create survival curves when patients are grouped by ER status and tumour grade.

library (survival)
par (mfrow = c (1 , 2))
plot (survfit (Surv(pData(vdx)$t.dmfs , pData(vdx)$e.dmfs) ~
  pData(vdx)$er) , col = c("cyan" , "salmon"))

plot (survfit(Surv(pData(vdx)$t.dmfs , pData(vdx)$e.dmfs) ~
  pData (vdx)$grade) , col = c("blue" , "yellow" , "orange"))

survdiff(Surv(pData(vdx)$t.dmfs , pData(vdx)$e.dmfs) ~
  pData (vdx)$er)
## Call:
## survdiff(formula = Surv(pData(vdx)$t.dmfs, pData(vdx)$e.dmfs) ~ 
##     pData(vdx)$er)
## 
##                   N Observed Expected (O-E)^2/E (O-E)^2/V
## pData(vdx)$er=0 135       38     45.5     1.244      2.04
## pData(vdx)$er=1 209       80     72.5     0.781      2.04
## 
##  Chisq= 2  on 1 degrees of freedom, p= 0.154
survdiff(Surv(pData(vdx)$t.dmfs , pData(vdx)$e.dmfs) ~
  pData(vdx)$grade)
## Call:
## survdiff(formula = Surv(pData(vdx)$t.dmfs, pData(vdx)$e.dmfs) ~ 
##     pData(vdx)$grade)
## 
## n=197, 147 observations deleted due to missingness.
## 
##                      N Observed Expected (O-E)^2/E (O-E)^2/V
## pData(vdx)$grade=1   7        0     3.06      3.06      3.21
## pData(vdx)$grade=2  42       10    16.69      2.68      3.53
## pData(vdx)$grade=3 148       61    51.26      1.85      6.72
## 
##  Chisq= 7.7  on 2 degrees of freedom, p= 0.0218